basic-memory

Overview Schema Related Servers Score Discussions

basic-memory
specs

SPEC-17 Semantic Search with ChromaDB.md•44.6 KiB

--- title: 'SPEC-17: Semantic Search with ChromaDB' type: spec permalink: specs/spec-17-semantic-search-chromadb tags: - search - chromadb - semantic-search - vector-database - postgres-migration --- # SPEC-17: Semantic Search with ChromaDB Why ChromaDB for Knowledge Management Your users aren't just searching for keywords - they're trying to: - "Find notes related to this concept" - "Show me similar ideas" - "What else did I write about this topic?" Example: # User searches: "AI ethics" # FTS5/MeiliSearch finds: - "AI ethics guidelines" ✅ - "ethical AI development" ✅ - "artificial intelligence" ❌ No keyword match # ChromaDB finds: - "AI ethics guidelines" ✅ - "ethical AI development" ✅ - "artificial intelligence" ✅ Semantic match! - "bias in ML models" ✅ Related concept - "responsible technology" ✅ Similar theme - "neural network fairness" ✅ Connected idea ChromaDB vs MeiliSearch vs Typesense | Feature | ChromaDB | MeiliSearch | Typesense | |------------------|--------------------|--------------------|--------------------| | Semantic Search | ✅ Excellent | ❌ No | ❌ No | | Keyword Search | ⚠️ Via metadata | ✅ Excellent | ✅ Excellent | | Local Deployment | ✅ Embedded mode | ⚠️ Server required | ⚠️ Server required | | No Server Needed | ✅ YES! | ❌ No | ❌ No | | Embedding Cost | ~$0.13/1M tokens | None | None | | Search Speed | 50-200ms | 10-50ms | 10-50ms | | Best For | Semantic discovery | Exact terms | Exact terms | The Killer Feature: Embedded Mode ChromaDB has an embedded client that runs in-process - NO SERVER NEEDED! # Local (FOSS) - ChromaDB embedded in Python process import chromadb client = chromadb.PersistentClient(path="/path/to/chroma_data") collection = client.get_or_create_collection("knowledge_base") # Add documents collection.add( ids=["note1", "note2"], documents=["AI ethics", "Neural networks"], metadatas=[{"type": "note"}, {"type": "spec"}] ) # Search - NO API calls, runs locally! results = collection.query( query_texts=["machine learning"], n_results=10 ) ## Why ### Current Problem: Database Persistence in Cloud In cloud deployments, `memory.db` (SQLite) doesn't persist across Docker container restarts. This means: - Database must be rebuilt on every container restart - Initial sync takes ~49 seconds for 500 files (after optimization in #352) - Users experience delays on each deployment ### Search Architecture Issues Current SQLite FTS5 implementation creates a **dual-implementation problem** for PostgreSQL migration: - FTS5 (SQLite) uses `VIRTUAL TABLE` with `MATCH` queries - PostgreSQL full-text search uses `TSVECTOR` with `@@` operator - These are fundamentally incompatible architectures - Would require **2x search code** and **2x tests** to support both **Example of incompatibility:** ```python # SQLite FTS5 "content_stems MATCH :text" # PostgreSQL "content_vector @@ plainto_tsquery(:text)" ``` ### Search Quality Limitations Current keyword-based FTS5 has limitations: - No semantic understanding (search "AI" doesn't find "machine learning") - No word relationships (search "neural networks" doesn't find "deep learning") - Limited typo tolerance - No relevance ranking beyond keyword matching ### Strategic Goal: PostgreSQL Migration Moving to PostgreSQL (Neon) for cloud deployments would: - ✅ Solve persistence issues (database survives restarts) - ✅ Enable multi-tenant architecture - ✅ Better performance for large datasets - ✅ Support for cloud-native scaling **But requires solving the search compatibility problem.** ## What Migrate from SQLite FTS5 to **ChromaDB** for semantic vector search across all deployments. **Key insight:** ChromaDB is **database-agnostic** - it works with both SQLite and PostgreSQL, eliminating the dual-implementation problem. ### Affected Areas - Search implementation (`src/basic_memory/repository/search_repository.py`) - Search service (`src/basic_memory/services/search_service.py`) - Search models (`src/basic_memory/models/search.py`) - Database initialization (`src/basic_memory/db.py`) - MCP search tools (`src/basic_memory/mcp/tools/search.py`) - Dependencies (`pyproject.toml` - add ChromaDB) - Alembic migrations (FTS5 table removal) - Documentation ### What Changes **Removed:** - SQLite FTS5 virtual table - `MATCH` query syntax - FTS5-specific tokenization and prefix handling - ~300 lines of FTS5 query preparation code **Added:** - ChromaDB persistent client (embedded mode) - Vector embedding generation - Semantic similarity search - Local embedding model (`sentence-transformers`) - Collection management for multi-project support ### What Stays the Same - Search API interface (MCP tools, REST endpoints) - Entity/Observation/Relation indexing workflow - Multi-project isolation - Search filtering by type, date, metadata - Pagination and result formatting - **All SQL queries for exact lookups and metadata filtering** ## Hybrid Architecture: SQL + ChromaDB **Critical Design Decision:** ChromaDB **complements** SQL, it doesn't **replace** it. ### Why Hybrid? ChromaDB is excellent for semantic text search but terrible for exact lookups. SQL is perfect for exact lookups and structured queries. We use both: ``` ┌─────────────────────────────────────────────────┐ │ Search Request │ └─────────────────────────────────────────────────┘ ▼ ┌────────────────────────┐ │ SearchRepository │ │ (Smart Router) │ └────────────────────────┘ ▼ ▼ ┌───────────┐ ┌──────────────┐ │ SQL │ │ ChromaDB │ │ Queries │ │ Semantic │ └───────────┘ └──────────────┘ ▼ ▼ Exact lookups Text search - Permalink - Semantic similarity - Pattern match - Related concepts - Title exact - Typo tolerance - Metadata filter - Fuzzy matching - Date ranges ``` ### When to Use Each #### Use SQL For (Fast & Exact) **Exact Permalink Lookup:** ```python # Find by exact permalink - SQL wins "SELECT * FROM entities WHERE permalink = 'specs/search-feature'" # ~1ms, perfect for exact matches # ChromaDB would be: ~50ms, wasteful ``` **Pattern Matching:** ```python # Find all specs - SQL wins "SELECT * FROM entities WHERE permalink GLOB 'specs/*'" # ~5ms, perfect for wildcards # ChromaDB doesn't support glob patterns ``` **Pure Metadata Queries:** ```python # Find all meetings tagged "important" - SQL wins "SELECT * FROM entities WHERE json_extract(entity_metadata, '$.entity_type') = 'meeting' AND json_extract(entity_metadata, '$.tags') LIKE '%important%'" # ~5ms, structured query # No text search needed, SQL is faster and simpler ``` **Date Filtering:** ```python # Find recent specs - SQL wins "SELECT * FROM entities WHERE entity_type = 'spec' AND created_at > '2024-01-01' ORDER BY created_at DESC" # ~2ms, perfect for structured data ``` #### Use ChromaDB For (Semantic & Fuzzy) **Semantic Content Search:** ```python # Find notes about "neural networks" - ChromaDB wins collection.query(query_texts=["neural networks"]) # Finds: "machine learning", "deep learning", "AI models" # ~50-100ms, semantic understanding # SQL FTS5 would only find exact keyword matches ``` **Text Search + Metadata:** ```python # Find meeting notes about "project planning" tagged "important" collection.query( query_texts=["project planning"], where={ "entity_type": "meeting", "tags": {"$contains": "important"} } ) # ~100ms, semantic search with filters # Finds: "roadmap discussion", "sprint planning", etc. ``` **Typo Tolerance:** ```python # User types "serch feature" (typo) - ChromaDB wins collection.query(query_texts=["serch feature"]) # Still finds: "search feature" documents # ~50-100ms, fuzzy matching # SQL would find nothing ``` ### Performance Comparison | Query Type | SQL | ChromaDB | Winner | |-----------|-----|----------|--------| | Exact permalink | 1-2ms | 50ms | ✅ SQL | | Pattern match (specs/*) | 5-10ms | N/A | ✅ SQL | | Pure metadata filter | 5ms | 50ms | ✅ SQL | | Semantic text search | ❌ Can't | 50-100ms | ✅ ChromaDB | | Text + metadata | ❌ Keywords only | 100ms | ✅ ChromaDB | | Typo tolerance | ❌ Can't | 50ms | ✅ ChromaDB | ### Metadata/Frontmatter Handling **Both systems support full frontmatter filtering!** #### SQL Metadata Storage ```python # Entities table stores frontmatter as JSON CREATE TABLE entities ( id INTEGER PRIMARY KEY, title TEXT, permalink TEXT, file_path TEXT, entity_type TEXT, entity_metadata JSON, -- All frontmatter here! created_at DATETIME, ... ) # Query frontmatter fields SELECT * FROM entities WHERE json_extract(entity_metadata, '$.entity_type') = 'meeting' AND json_extract(entity_metadata, '$.tags') LIKE '%important%' AND json_extract(entity_metadata, '$.status') = 'completed' ``` #### ChromaDB Metadata Storage ```python # When indexing, store ALL frontmatter as metadata class ChromaSearchBackend: async def index_entity(self, entity: Entity): """Index with complete frontmatter metadata.""" # Extract ALL frontmatter fields metadata = { "entity_id": entity.id, "project_id": entity.project_id, "permalink": entity.permalink, "file_path": entity.file_path, "entity_type": entity.entity_type, "type": "entity", # ALL frontmatter tags "tags": entity.entity_metadata.get("tags", []), # Custom frontmatter fields "status": entity.entity_metadata.get("status"), "priority": entity.entity_metadata.get("priority"), # Spread any other custom fields **{k: v for k, v in entity.entity_metadata.items() if k not in ["tags", "entity_type"]} } self.collection.upsert( ids=[f"entity_{entity.id}_{entity.project_id}"], documents=[self._format_document(entity)], metadatas=[metadata] # Full frontmatter! ) ``` #### ChromaDB Metadata Queries ChromaDB supports rich filtering: ```python # Simple filter - single field collection.query( query_texts=["project planning"], where={"entity_type": "meeting"} ) # Multiple conditions (AND) collection.query( query_texts=["architecture decisions"], where={ "entity_type": "spec", "tags": {"$contains": "important"} } ) # Complex filters with operators collection.query( query_texts=["machine learning"], where={ "$and": [ {"entity_type": {"$in": ["note", "spec"]}}, {"tags": {"$contains": "AI"}}, {"created_at": {"$gt": "2024-01-01"}}, {"status": "in-progress"} ] } ) # Multiple tags (all must match) collection.query( query_texts=["cloud architecture"], where={ "$and": [ {"tags": {"$contains": "architecture"}}, {"tags": {"$contains": "cloud"}} ] } ) ``` ### Smart Routing Implementation ```python class SearchRepository: def __init__( self, session_maker: async_sessionmaker[AsyncSession], project_id: int, chroma_backend: ChromaSearchBackend ): self.sql = session_maker # Keep SQL! self.chroma = chroma_backend self.project_id = project_id async def search( self, search_text: Optional[str] = None, permalink: Optional[str] = None, permalink_match: Optional[str] = None, title: Optional[str] = None, types: Optional[List[str]] = None, tags: Optional[List[str]] = None, after_date: Optional[datetime] = None, custom_metadata: Optional[dict] = None, limit: int = 10, offset: int = 0, ) -> List[SearchIndexRow]: """Smart routing between SQL and ChromaDB.""" # ========================================== # Route 1: Exact Lookups → SQL (1-5ms) # ========================================== if permalink: # Exact permalink: "specs/search-feature" return await self._sql_permalink_lookup(permalink) if permalink_match: # Pattern match: "specs/*" return await self._sql_pattern_match(permalink_match) if title and not search_text: # Exact title lookup (no semantic search needed) return await self._sql_title_match(title) # ========================================== # Route 2: Pure Metadata → SQL (5-10ms) # ========================================== # No text search, just filtering by metadata if not search_text and (types or tags or after_date or custom_metadata): return await self._sql_metadata_filter( types=types, tags=tags, after_date=after_date, custom_metadata=custom_metadata, limit=limit, offset=offset ) # ========================================== # Route 3: Text Search → ChromaDB (50-100ms) # ========================================== if search_text: # Build ChromaDB metadata filters where_filters = self._build_chroma_filters( types=types, tags=tags, after_date=after_date, custom_metadata=custom_metadata ) # Semantic search with metadata filtering return await self.chroma.search( query_text=search_text, project_id=self.project_id, where=where_filters, limit=limit ) # ========================================== # Route 4: List All → SQL (2-5ms) # ========================================== return await self._sql_list_entities( limit=limit, offset=offset ) def _build_chroma_filters( self, types: Optional[List[str]] = None, tags: Optional[List[str]] = None, after_date: Optional[datetime] = None, custom_metadata: Optional[dict] = None ) -> dict: """Build ChromaDB where clause from filters.""" filters = {"project_id": self.project_id} # Type filtering if types: if len(types) == 1: filters["entity_type"] = types[0] else: filters["entity_type"] = {"$in": types} # Tag filtering (array contains) if tags: if len(tags) == 1: filters["tags"] = {"$contains": tags[0]} else: # Multiple tags - all must match filters = { "$and": [ filters, *[{"tags": {"$contains": tag}} for tag in tags] ] } # Date filtering if after_date: filters["created_at"] = {"$gt": after_date.isoformat()} # Custom frontmatter fields if custom_metadata: filters.update(custom_metadata) return filters async def _sql_metadata_filter( self, types: Optional[List[str]] = None, tags: Optional[List[str]] = None, after_date: Optional[datetime] = None, custom_metadata: Optional[dict] = None, limit: int = 10, offset: int = 0 ) -> List[SearchIndexRow]: """Pure metadata queries using SQL.""" conditions = ["project_id = :project_id"] params = {"project_id": self.project_id} if types: type_list = ", ".join(f"'{t}'" for t in types) conditions.append(f"entity_type IN ({type_list})") if tags: # Check each tag for i, tag in enumerate(tags): param_name = f"tag_{i}" conditions.append( f"json_extract(entity_metadata, '$.tags') LIKE :{param_name}" ) params[param_name] = f"%{tag}%" if after_date: conditions.append("created_at > :after_date") params["after_date"] = after_date if custom_metadata: for key, value in custom_metadata.items(): param_name = f"meta_{key}" conditions.append( f"json_extract(entity_metadata, '$.{key}') = :{param_name}" ) params[param_name] = value where = " AND ".join(conditions) sql = f""" SELECT * FROM entities WHERE {where} ORDER BY created_at DESC LIMIT :limit OFFSET :offset """ params["limit"] = limit params["offset"] = offset async with db.scoped_session(self.session_maker) as session: result = await session.execute(text(sql), params) return self._format_sql_results(result) ``` ### Real-World Examples #### Example 1: Pure Metadata Query (No Text) ```python # "Find all meetings tagged 'important'" results = await search_repo.search( types=["meeting"], tags=["important"] ) # Routing: → SQL (~5ms) # SQL: SELECT * FROM entities # WHERE entity_type = 'meeting' # AND json_extract(entity_metadata, '$.tags') LIKE '%important%' ``` #### Example 2: Semantic Search (No Metadata) ```python # "Find notes about neural networks" results = await search_repo.search( search_text="neural networks" ) # Routing: → ChromaDB (~80ms) # Finds: "machine learning", "deep learning", "AI models", etc. ``` #### Example 3: Semantic + Metadata ```python # "Find meeting notes about 'project planning' tagged 'important'" results = await search_repo.search( search_text="project planning", types=["meeting"], tags=["important"] ) # Routing: → ChromaDB with filters (~100ms) # ChromaDB: query_texts=["project planning"] # where={"entity_type": "meeting", # "tags": {"$contains": "important"}} # Finds: "roadmap discussion", "sprint planning", etc. ``` #### Example 4: Complex Frontmatter Query ```python # "Find in-progress specs with multiple tags, recent" results = await search_repo.search( types=["spec"], tags=["architecture", "cloud"], after_date=datetime(2024, 1, 1), custom_metadata={"status": "in-progress"} ) # Routing: → SQL (~10ms) # No text search, pure structured query - SQL is faster ``` #### Example 5: Semantic + Complex Metadata ```python # "Find notes about 'authentication' that are in-progress" results = await search_repo.search( search_text="authentication", custom_metadata={"status": "in-progress", "priority": "high"} ) # Routing: → ChromaDB with metadata filters (~100ms) # Semantic search for "authentication" concept # Filters by status and priority in metadata ``` #### Example 6: Exact Permalink ```python # "Show me specs/search-feature" results = await search_repo.search( permalink="specs/search-feature" ) # Routing: → SQL (~1ms) # SQL: SELECT * FROM entities WHERE permalink = 'specs/search-feature' ``` #### Example 7: Pattern Match ```python # "Show me all specs" results = await search_repo.search( permalink_match="specs/*" ) # Routing: → SQL (~5ms) # SQL: SELECT * FROM entities WHERE permalink GLOB 'specs/*' ``` ### What We Remove vs Keep **REMOVE (FTS5-specific):** - ❌ `CREATE VIRTUAL TABLE search_index USING fts5(...)` - ❌ `MATCH` operator queries - ❌ FTS5 tokenization configuration - ❌ ~300 lines of FTS5 query preparation code - ❌ Trigram generation and prefix handling **KEEP (Standard SQL):** - ✅ `SELECT * FROM entities WHERE permalink = :permalink` - ✅ `SELECT * FROM entities WHERE permalink GLOB :pattern` - ✅ `SELECT * FROM entities WHERE title LIKE :title` - ✅ `SELECT * FROM entities WHERE json_extract(entity_metadata, ...) = :value` - ✅ All date filtering, pagination, sorting - ✅ Entity table structure and indexes **ADD (ChromaDB):** - ✅ ChromaDB persistent client (embedded) - ✅ Semantic vector search - ✅ Metadata filtering in ChromaDB - ✅ Smart routing logic ## How (High Level) ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ FOSS Deployment (Local) │ ├─────────────────────────────────────────────────────────────┤ │ SQLite (data) + ChromaDB embedded (search) │ │ - No external services │ │ - Local embedding model (sentence-transformers) │ │ - Persists in ~/.basic-memory/chroma_data/ │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ Cloud Deployment (Multi-tenant) │ ├─────────────────────────────────────────────────────────────┤ │ PostgreSQL/Neon (data) + ChromaDB server (search) │ │ - Neon serverless Postgres for persistence │ │ - ChromaDB server in Docker container │ │ - Optional: OpenAI embeddings for better quality │ └─────────────────────────────────────────────────────────────┘ ``` ### Phase 1: ChromaDB Integration (2-3 days) #### 1. Add ChromaDB Dependency ```toml # pyproject.toml dependencies = [ "chromadb>=0.4.0", "sentence-transformers>=2.2.0", # Local embeddings ] ``` #### 2. Create ChromaSearchBackend ```python # src/basic_memory/search/chroma_backend.py from chromadb import PersistentClient from chromadb.utils import embedding_functions class ChromaSearchBackend: def __init__( self, persist_directory: Path, collection_name: str = "knowledge_base", embedding_model: str = "all-MiniLM-L6-v2" ): """Initialize ChromaDB with local embeddings.""" self.client = PersistentClient(path=str(persist_directory)) # Use local sentence-transformers model (no API costs) self.embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction( model_name=embedding_model ) self.collection = self.client.get_or_create_collection( name=collection_name, embedding_function=self.embed_fn, metadata={"hnsw:space": "cosine"} # Similarity metric ) async def index_entity(self, entity: Entity): """Index entity with automatic embeddings.""" # Combine title and content for semantic search document = self._format_document(entity) self.collection.upsert( ids=[f"entity_{entity.id}_{entity.project_id}"], documents=[document], metadatas=[{ "entity_id": entity.id, "project_id": entity.project_id, "permalink": entity.permalink, "file_path": entity.file_path, "entity_type": entity.entity_type, "type": "entity", }] ) async def search( self, query_text: str, project_id: int, limit: int = 10, filters: dict = None ) -> List[SearchResult]: """Semantic search with metadata filtering.""" where = {"project_id": project_id} if filters: where.update(filters) results = self.collection.query( query_texts=[query_text], n_results=limit, where=where ) return self._format_results(results) ``` #### 3. Update SearchRepository ```python # src/basic_memory/repository/search_repository.py class SearchRepository: def __init__( self, session_maker: async_sessionmaker[AsyncSession], project_id: int, chroma_backend: ChromaSearchBackend ): self.session_maker = session_maker self.project_id = project_id self.chroma = chroma_backend async def search( self, search_text: Optional[str] = None, permalink: Optional[str] = None, # ... other filters ) -> List[SearchIndexRow]: """Search using ChromaDB for text, SQL for exact lookups.""" # For exact permalink/pattern matches, use SQL if permalink or permalink_match: return await self._sql_exact_search(...) # For text search, use ChromaDB semantic search if search_text: results = await self.chroma.search( query_text=search_text, project_id=self.project_id, limit=limit, filters=self._build_filters(types, after_date, ...) ) return results # Fallback to listing all return await self._list_entities(...) ``` #### 4. Update SearchService ```python # src/basic_memory/services/search_service.py class SearchService: def __init__( self, search_repository: SearchRepository, entity_repository: EntityRepository, file_service: FileService, chroma_backend: ChromaSearchBackend, ): self.repository = search_repository self.entity_repository = entity_repository self.file_service = file_service self.chroma = chroma_backend async def index_entity(self, entity: Entity): """Index entity in ChromaDB.""" if entity.is_markdown: await self._index_entity_markdown(entity) else: await self._index_entity_file(entity) async def _index_entity_markdown(self, entity: Entity): """Index markdown entity with full content.""" # Index entity await self.chroma.index_entity(entity) # Index observations (as separate documents) for obs in entity.observations: await self.chroma.index_observation(obs, entity) # Index relations (metadata only) for rel in entity.outgoing_relations: await self.chroma.index_relation(rel, entity) ``` ### Phase 2: PostgreSQL Support (1 day) #### 1. Add PostgreSQL Database Type ```python # src/basic_memory/db.py class DatabaseType(Enum): MEMORY = auto() FILESYSTEM = auto() POSTGRESQL = auto() # NEW @classmethod def get_db_url(cls, db_path_or_url: str, db_type: "DatabaseType") -> str: if db_type == cls.POSTGRESQL: return db_path_or_url # Neon connection string elif db_type == cls.MEMORY: return "sqlite+aiosqlite://" return f"sqlite+aiosqlite:///{db_path_or_url}" ``` #### 2. Update Connection Handling ```python def _create_engine_and_session(...): db_url = DatabaseType.get_db_url(db_path_or_url, db_type) if db_type == DatabaseType.POSTGRESQL: # Use asyncpg driver for Postgres engine = create_async_engine( db_url, pool_size=10, max_overflow=20, pool_pre_ping=True, # Health checks ) else: # SQLite configuration engine = create_async_engine(db_url, connect_args=connect_args) # Only configure SQLite-specific settings for SQLite if db_type != DatabaseType.MEMORY: @event.listens_for(engine.sync_engine, "connect") def enable_wal_mode(dbapi_conn, connection_record): _configure_sqlite_connection(dbapi_conn, enable_wal=True) return engine, async_sessionmaker(engine, expire_on_commit=False) ``` #### 3. Remove SQLite-Specific Code ```python # Remove from scoped_session context manager: # await session.execute(text("PRAGMA foreign_keys=ON")) # DELETE # PostgreSQL handles foreign keys by default ``` ### Phase 3: Migration & Testing (1-2 days) #### 1. Create Migration Script ```python # scripts/migrate_to_chromadb.py async def migrate_fts5_to_chromadb(): """One-time migration from FTS5 to ChromaDB.""" # 1. Read all entities from database entities = await entity_repository.find_all() # 2. Index in ChromaDB for entity in entities: await search_service.index_entity(entity) # 3. Drop FTS5 table (Alembic migration) await session.execute(text("DROP TABLE IF EXISTS search_index")) ``` #### 2. Update Tests - Replace FTS5 test fixtures with ChromaDB fixtures - Test semantic search quality - Test multi-project isolation in ChromaDB - Benchmark performance vs FTS5 #### 3. Documentation Updates - Update search documentation - Add ChromaDB configuration guide - Document embedding model options - PostgreSQL deployment guide ### Configuration ```python # config.py class BasicMemoryConfig: # Database database_type: DatabaseType = DatabaseType.FILESYSTEM database_path: Path = Path.home() / ".basic-memory" / "memory.db" database_url: Optional[str] = None # For Postgres: postgresql://... # Search chroma_persist_directory: Path = Path.home() / ".basic-memory" / "chroma_data" embedding_model: str = "all-MiniLM-L6-v2" # Local model embedding_provider: str = "local" # or "openai" openai_api_key: Optional[str] = None # For cloud deployments ``` ### Deployment Configurations #### Local (FOSS) ```yaml # Default configuration database_type: FILESYSTEM database_path: ~/.basic-memory/memory.db chroma_persist_directory: ~/.basic-memory/chroma_data embedding_model: all-MiniLM-L6-v2 embedding_provider: local ``` #### Cloud (Docker Compose) ```yaml services: postgres: image: postgres:15 environment: POSTGRES_DB: basic_memory POSTGRES_PASSWORD: ${DB_PASSWORD} chromadb: image: chromadb/chroma:latest volumes: - chroma_data:/chroma/chroma environment: ALLOW_RESET: true app: environment: DATABASE_TYPE: POSTGRESQL DATABASE_URL: postgresql://postgres:${DB_PASSWORD}@postgres/basic_memory CHROMA_HOST: chromadb CHROMA_PORT: 8000 EMBEDDING_PROVIDER: local # or openai ``` ## How to Evaluate ### Success Criteria #### Functional Requirements - ✅ Semantic search finds related concepts (e.g., "AI" finds "machine learning") - ✅ Exact permalink/pattern matches work (e.g., `specs/*`) - ✅ Multi-project isolation maintained - ✅ All existing search filters work (type, date, metadata) - ✅ MCP tools continue to work without changes - ✅ Works with both SQLite and PostgreSQL #### Performance Requirements - ✅ Search latency < 200ms for 1000 documents (local embedding) - ✅ Indexing time comparable to FTS5 (~10 files/sec) - ✅ Initial sync time not significantly worse than current - ✅ Memory footprint < 1GB for local deployments #### Quality Requirements - ✅ Better search relevance than FTS5 keyword matching - ✅ Handles typos and word variations - ✅ Finds semantically similar content #### Deployment Requirements - ✅ FOSS: Works out-of-box with no external services - ✅ Cloud: Integrates with PostgreSQL (Neon) - ✅ No breaking changes to MCP API - ✅ Migration script for existing users ### Testing Procedure #### 1. Unit Tests ```bash # Test ChromaDB backend pytest tests/test_chroma_backend.py # Test search repository with ChromaDB pytest tests/test_search_repository.py # Test search service pytest tests/test_search_service.py ``` #### 2. Integration Tests ```bash # Test full search workflow pytest test-int/test_search_integration.py # Test with PostgreSQL DATABASE_TYPE=POSTGRESQL pytest test-int/ ``` #### 3. Semantic Search Quality Tests ```python # Test semantic similarity search("machine learning") should find: - "neural networks" - "deep learning" - "AI algorithms" search("software architecture") should find: - "system design" - "design patterns" - "microservices" ``` #### 4. Performance Benchmarks ```bash # Run search benchmarks pytest test-int/test_search_performance.py -v # Measure: - Search latency (should be < 200ms) - Indexing throughput (should be ~10 files/sec) - Memory usage (should be < 1GB) ``` #### 5. Migration Testing ```bash # Test migration from FTS5 to ChromaDB python scripts/migrate_to_chromadb.py # Verify all entities indexed # Verify search results quality # Verify no data loss ``` ### Metrics **Search Quality:** - Semantic relevance score (manual evaluation) - Precision/recall for common queries - User satisfaction (qualitative) **Performance:** - Average search latency (ms) - P95/P99 search latency - Indexing throughput (files/sec) - Memory usage (MB) **Deployment:** - Local deployment success rate - Cloud deployment success rate - Migration success rate ## Implementation Checklist ### Phase 1: ChromaDB Integration - [ ] Add ChromaDB and sentence-transformers dependencies - [ ] Create ChromaSearchBackend class - [ ] Update SearchRepository to use ChromaDB - [ ] Update SearchService indexing methods - [ ] Remove FTS5 table creation code - [ ] Update search query logic - [ ] Add ChromaDB configuration to BasicMemoryConfig ### Phase 2: PostgreSQL Support - [ ] Add DatabaseType.POSTGRESQL enum - [ ] Update get_db_url() for Postgres connection strings - [ ] Add asyncpg dependency - [ ] Update engine creation for Postgres - [ ] Remove SQLite-specific PRAGMA statements - [ ] Test with Neon database ### Phase 3: Testing & Migration - [ ] Write unit tests for ChromaSearchBackend - [ ] Update search integration tests - [ ] Add semantic search quality tests - [ ] Create performance benchmarks - [ ] Write migration script from FTS5 - [ ] Test migration with existing data - [ ] Update documentation ### Phase 4: Deployment - [ ] Update docker-compose.yml for cloud - [ ] Document local FOSS deployment - [ ] Document cloud PostgreSQL deployment - [ ] Create migration guide for users - [ ] Update MCP tool documentation ## Notes ### Embedding Model Trade-offs **Local Model: `all-MiniLM-L6-v2`** - Size: 80MB download - Speed: ~50ms embedding time - Dimensions: 384 - Cost: $0 - Quality: Good for general knowledge - Best for: FOSS deployments **OpenAI: `text-embedding-3-small`** - Speed: ~100-200ms (API call) - Dimensions: 1536 - Cost: ~$0.13 per 1M tokens (~$0.01 per 1000 notes) - Quality: Excellent - Best for: Cloud deployments with budget ### ChromaDB Storage ChromaDB stores data in: ``` ~/.basic-memory/chroma_data/ ├── chroma.sqlite3 # Metadata ├── index/ # HNSW indexes └── collections/ # Vector data ``` Typical sizes: - 100 notes: ~5MB - 1000 notes: ~50MB - 10000 notes: ~500MB ### Why Not Keep FTS5? **Considered:** Hybrid approach (FTS5 for SQLite + tsvector for Postgres) **Rejected because:** - 2x the code to maintain - 2x the tests to write - 2x the bugs to fix - Inconsistent search behavior between deployments - ChromaDB provides better search quality anyway **ChromaDB wins:** - One implementation for both databases - Better search quality (semantic!) - Database-agnostic architecture - Embedded mode for FOSS (no servers needed) ## implementation Proposed Architecture Option 1: ChromaDB Only (Simplest) class ChromaSearchBackend: def __init__(self, path: str, embedding_model: str = "all-MiniLM-L6-v2"):yes # For local: embedded client (no server!) self.client = chromadb.PersistentClient(path=path) # Use local embedding model (no API costs!) from chromadb.utils import embedding_functions self.embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction( model_name=embedding_model ) self.collection = self.client.get_or_create_collection( name="knowledge_base", embedding_function=self.embed_fn ) async def index_entity(self, entity: Entity): # ChromaDB handles embeddings automatically! self.collection.upsert( ids=[str(entity.id)], documents=[f"{entity.title}\n{entity.content}"], metadatas=[{ "permalink": entity.permalink, "type": entity.entity_type, "file_path": entity.file_path }] ) async def search(self, query: str, filters: dict = None): # Semantic search with optional metadata filters results = self.collection.query( query_texts=[query], n_results=10, where=filters # e.g., {"type": "note"} ) return results Deployment: - Local (FOSS): ChromaDB embedded, local embedding model, NO servers - Cloud: ChromaDB server OR still embedded (it's just a Python lib!) Option 2: Hybrid FTS + ChromaDB (Best UX) class HybridSearchBackend: def __init__(self): self.fts = SQLiteFTS5Backend() # Fast keyword search self.chroma = ChromaSearchBackend() # Semantic search async def search(self, query: str, search_type: str = "auto"): if search_type == "exact": # User wants exact match: "specs/search-feature" return await self.fts.search(query) elif search_type == "semantic": # User wants related concepts return await self.chroma.search(query) else: # "auto" # Check if query looks like exact match if "/" in query or query.startswith('"'): return await self.fts.search(query) # Otherwise use semantic search return await self.chroma.search(query) Embedding Options Option A: Local Model (FREE, FOSS-friendly) # Uses sentence-transformers (runs locally) # Model: ~100MB download # Speed: ~50-100ms for embedding # Cost: $0 from chromadb.utils import embedding_functions embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="all-MiniLM-L6-v2" # Fast, accurate, free ) Option B: OpenAI Embeddings (Cloud only) # For cloud users who want best quality # Model: text-embedding-3-small # Speed: ~100-200ms via API # Cost: ~$0.13 per 1M tokens (~$0.01 per 1000 notes) embed_fn = embedding_functions.OpenAIEmbeddingFunction( api_key="...", model_name="text-embedding-3-small" ) Performance Comparison Local embedding model: all-MiniLM-L6-v2 Embedding time: ~50ms per note Search time: ~100ms for 1000 notes Memory: ~500MB (model + ChromaDB) Cost: $0 Quality: Good (384 dimensions) OpenAI embeddings: text-embedding-3-small Embedding time: ~100-200ms per note (API call) Search time: ~50ms for 1000 notes Cost: ~$0.01 per 1000 notes Quality: Excellent (1536 dimensions) My Recommendation: ChromaDB with Local Embeddings Here's the plan: Phase 1: Local ChromaDB (1-2 days) # FOSS version - SQLite for data persistence - ChromaDB embedded for semantic search - Local embedding model (no API costs) - NO external services required Benefits: - ✅ Same deployment as current (just Python package) - ✅ Semantic search for better UX - ✅ Free embeddings with local model - ✅ No servers needed Phase 2: Postgres + ChromaDB Cloud (1-2 days) # Cloud version - Postgres for data persistence - ChromaDB server for semantic search - OpenAI embeddings (higher quality) - OR keep local embeddings (cheaper) Phase 3: Hybrid Search (optional, 1 day) # Add FTS for exact matches alongside ChromaDB - Quick keyword search when needed - Semantic search for exploration - Best of both worlds Code Estimate Just ChromaDB (replacing FTS5): - Remove FTS5 code: 2 hours - Add ChromaDB backend: 4 hours - Update search service: 2 hours - Testing: 4 hours - Total: 1.5 days ChromaDB + Postgres migration: - Add Postgres support: 4 hours - Test with Neon: 2 hours - Total: +0.75 days Grand total: 2-3 days for complete migration The Kicker ChromaDB solves BOTH problems: 1. ✅ Works with SQLite AND Postgres (it's separate!) 2. ✅ No server needed for local (embedded mode) 3. ✅ Better search than FTS5 (semantic!) 4. ✅ One implementation for both deployments Want me to prototype this? I can show you: 1. ChromaDB embedded with local embeddings 2. Example searches showing semantic matching 3. Performance benchmarks 4. Migration from FTS5 ## Observations - [problem] SQLite FTS5 and PostgreSQL tsvector are incompatible architectures requiring dual implementation #database-compatibility - [problem] Cloud deployments lose database on container restart requiring full re-sync #persistence - [solution] ChromaDB provides database-agnostic semantic search eliminating dual implementation #architecture - [advantage] Semantic search finds related concepts beyond keyword matching improving UX #search-quality - [deployment] Embedded ChromaDB requires no external services for FOSS #simplicity - [migration] Moving to PostgreSQL solves cloud persistence issues #cloud-architecture - [performance] Local embedding models provide good quality at zero cost #cost-optimization - [trade-off] Embedding generation adds ~50ms latency vs instant FTS5 indexing #performance - [benefit] Single search codebase reduces maintenance burden and test coverage needs #maintainability ## Prior Art / References ### Community Fork: manuelbliemel/basic-memory (feature/vector-search) **Repository**: https://github.com/manuelbliemel/basic-memory/tree/feature/vector-search **Key Implementation Details**: **Vector Database**: ChromaDB (same as our approach!) **Embedding Models**: - Local: `all-MiniLM-L6-v2` (default, 384 dims) - same model we planned - Also supports: `all-mpnet-base-v2`, `paraphrase-MiniLM-L6-v2`, `multi-qa-MiniLM-L6-cos-v1` - OpenAI: `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large` **Chunking Strategy** (interesting - we didn't consider this): - Chunk Size: 500 characters - Chunk Overlap: 50 characters - Breaks documents into smaller pieces for better semantic search **Search Strategies**: 1. `fuzzy_only` (default) - FTS5 only 2. `vector_only` - ChromaDB only 3. `hybrid` (recommended) - Both FTS5 + ChromaDB 4. `fuzzy_primary` - FTS5 first, ChromaDB fallback 5. `vector_primary` - ChromaDB first, FTS5 fallback **Configuration**: - Similarity Threshold: 0.1 - Max Results: 5 - Storage: `~/.basic-memory/chroma/` - Config: `~/.basic-memory/config.json` **Key Differences from Our Approach**: | Aspect | Their Approach | Our Approach | |--------|---------------|--------------| | FTS5 | Keep FTS5 + add ChromaDB | Remove FTS5, use SQL for exact lookups | | Search Strategy | 5 configurable strategies | Smart routing (automatic) | | Document Processing | Chunk into 500-char pieces | Index full documents | | Hybrid Mode | Run both, merge, dedupe | Route to best backend | | Configuration | User-configurable strategy | Automatic based on query type | **What We Can Learn**: 1. **Chunking**: Breaking documents into 500-character chunks with 50-char overlap may improve semantic search quality for long documents - Pro: Better granularity for semantic matching - Con: More vectors to store and search - Consider: Optional chunking for large documents (>2000 chars) 2. **Configurable Strategies**: Allowing users to choose search strategy provides flexibility - Pro: Power users can tune behavior - Con: More complexity, most users won't configure - Consider: Default to smart routing, allow override via config 3. **Similarity Threshold**: They use 0.1 as default - Consider: Benchmark different thresholds for quality 4. **Storage Location**: `~/.basic-memory/chroma/` matches our planned `chroma_data/` approach **Potential Collaboration**: - Their implementation is nearly complete as a fork - Could potentially merge their work or use as reference implementation - Their chunking strategy could be valuable addition to our approach ## Relations - implements [[SPEC-11 Basic Memory API Performance Optimization]] - relates_to [[Performance Optimizations Documentation]] - enables [[PostgreSQL Migration]] - improves_on [[SQLite FTS5 Search]] - references [[manuelbliemel/basic-memory feature/vector-search fork]]

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/basicmachines-co/basic-memory'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

SPEC-17 Semantic Search with ChromaDB.md•44.6 KiB